Wine Quality

Dataset Credits:

http://www3.dsi.uminho.pt/pcortez/wine/

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Features :

Outcome Variable:

Sanity Check

Exploring Data

We combined two different but very similar data sets. One that covered red wines and one that covered white wines. Very similar because the columns were exactly the same.

There are no particular differences between the two rating distributions among different wine qualities. Although we combined two different data sets.

Linear Regression

Purpose

The intent of this work is to be able to predict the quality of a wine based on its its organoleptic characteristics.

Making a guess

Assuming that there can be a direct correlation between a single characteristic and wine quality, can we already identify which characteristic might help us predict the value of quality?

Correlations between quality and a given combination of values of multiple characteristics cannot be identified by these indices. But this will have to be understood by the model we build.

If we look at the heat map below, we can identify which features have a higher correlation index. But they may not necessarily be the ones relevant to the prediction.

Linear Regression

Ridge

We varied the regularization parameter to find the best value of alpha

The best alpha for the lasso is 1.12. What stands out is the enormous preponderance of one feature over the others.

Our conjecture was wrong. Instead, the feature that seems to prevail is density, with overwhelming evidence. The MSE is about 0.53, a relatively low error. Let us now see what happens with Lasso.

Lasso

Density is the most relevant datum for our Lasso model. The MSE is 0.53, which is actually a good value for this datum.

Classification

Classification between white wine and red wine could be predicted.

Decision Tree

The data collected from the classifier report indicate that:

The prediction made with the decision tree is particularly good. In fact, we notice that on the diagonal we find most of the results. The classification made with the type of wine is easy for the model to understand, perhaps because wines take on very different characteristics between red and white wine.

K-Neighbors

The data collected from the classifier report indicate that:

In general, we can say that the classifier seems to work well, to the point of being very close in terms of accuracy to the decision tree.

For K-Neighbors the classification seems to work well, as we find a concentration on diagonal TP-TN values compared to the other FP-FNs. Apparently it seems to perform less well than the previous classifier (Decision Tree), although in reality, comparing the ratios described above (precision, recall and f1-score), we find that they are more or less equivalent.

Conclusion

We considered two very similar data sets, one of white wines and one of red wines, combined them and created a "type" column that we then used for classification.

Looking at the correlation indices, we noticed a high correlation between wine quality and alcohol quantity, only to find that it was instead density that was particularly useful in determining the quality of a wine.

Through linear regression, our model is able to predict wine quality with an accuracy of ~0.532 with the Ridge and Lasso regressors. Finally, we compared two classifiers (K-Neighbors and Decision Tree) to try to accurately determine the type of wine (white or red).